Skip to content

chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400

Open
donriddo wants to merge 41 commits into
tetherto:mainfrom
donriddo:feat/benchmark-perf-llm-suite
Open

chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400
donriddo wants to merge 41 commits into
tetherto:mainfrom
donriddo:feat/benchmark-perf-llm-suite

Conversation

@donriddo

@donriddo donriddo commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🎯 What problem does this PR solve?

The WB team needs throughput numbers (TTFT, TPS, ppTPS) for Qwen3.5-0.8B and 2B across quantizations Q4_0, Q4_1, Q4_K_M, Q6_K, Q8_0 and reasoning-budget -1/0, on both desktop and mobile including KV-cache types on mobile — plus the ability to catch regressions between addon versions.

📝 How does it solve it?

Coverage

  • Models: Qwen3.5-0.8B + 2B (5 quants each), keep Qwen3-1.7B as a desktop comparison baseline, drop Qwen3-4B. No PyTorch.
  • Reasoning budget -1 and 0; single ~512-token prompt (verified at ~518 templated tokens against the Qwen3.5 tokenizer).
  • Mobile KV-cache types f16, q8_0, q4_0, plus TurboQuant/PolarQuant (tbq3_0/pq3_0, tbq4_0/pq4_0, pq3_0, pq4_0); desktop runs GPU, mobile runs both gpu and cpu.

Report — unified renderer (render-report.js), one identical table per device (desktop + 5 mobile):

  • Columns: TTFT (ms) · TPS · ppTPS · Tokens, each as mean ± stddev across repeats (desktop 5, mobile 3).
  • Header records addon version, prompt size, repeats, and the detected desktop GPU — version + GPU are stamped into the run's artifacts so they're accurate and survive a later re-render.
  • Crashed rows for unsupported combos (e.g. quantized KV cache on Adreno GPUs, or TurboQuant/PolarQuant on iOS Metal and Samsung GPU — run anyway, detected, reported).
  • Best configuration per device (highest TPS, highest ppTPS).

Cross-run comparison (regression detection)

  • summarize_only re-renders a previous run's report in ~1 min, skipping the ~6h benchmarks.
  • compare_run_id adds Δ TTFT / TPS / ppTPS columns vs a baseline run (downloads both runs' artifacts; no re-run needed). The baseline's version is read from its stamp, so the comparison is never mislabelled.

Mobile execution

  • Sharded one group per (model × KV-cache type) = 70 shards (2 sizes × 5 quants × 7 KV-cache types), run as 7 sequential KV-cache batches to fit the Device Farm per-test ceiling and avoid pool/disk exhaustion. 3 measured repetitions per config.
  • The 70 shard files and the workflow's test_groups are generated from one source of truth (test/integration/_benchmark-matrix.js) and are not committed. CI regenerates them before the Device Farm bundle and hard-fails if any are missing or have drifted from the matrix, so the benchmark can never run against a stale or partial shard set.
  • Deliberately absent from test-groups.json; scheduled only via the workflow's test_groups override.

Workflow inputs (no per-run configurability of the matrix — it's fixed in the scripts):
ref, run_desktop, run_mobile, summarize_only, artifact_run_id, compare_run_id. The shared integration-mobile-test-llm-llamacpp.yml gains two additive optional inputs (job_timeout_minutes default 120, artifact_suffix default empty) — backward-compatible for other addon callers.

🧪 How was it tested?

  • npx standard clean; validate-mobile-tests.js in sync; verify:benchmark-shards confirms the matrix, the generated integration.auto.cjs (shard-file refs and run-function names), and the workflow test_groups are all in lockstep, so a generator change can't silently desync the Device Farm grep.
  • Generation pipeline verified locally: from a fresh checkout (shards absent) test:integration:generate regenerates everything with zero drift in the committed integration.auto.cjs; the mobile-only benchmark shards skip cleanly on desktop.
  • Validated end-to-end across every input combination with real runs (full, desktop-only, mobile-only, re-render, comparison).
    • One full run — desktop + the complete 70-shard mobile matrix in a single pass: https://github.com/tetherto/qvac/actions/runs/27490240383
      • Desktop sweep on the self-hosted GPU (Desktop (NVIDIA RTX 4000 SFF Ada Generation), desktop=5).
      • Mobile 70-shard matrix — 2 sizes × 5 quants × 7 KV-cache types (incl. TurboQuant/PolarQuant), mobile=3 with mean ± stddev and best-config per device. iPhone 16/17 report the full 70/70; the combos their GPUs don't support (Adreno quantized-KV, TurboQuant on Metal) surface as Crashed rows / coverage gaps, as intended.
    • Per-batch wall-clock ran 54–114 min under the 180-min cap; the 7 KV-cache batches run sequentially.

💥 Known findings from the runs (data, not code issues)

  • Adreno GPU (Samsung S25/S26) crashes on all gpu + kv=q4_0 and gpu + kv=q8_0 — confirmed and reported as Crashed. CPU path handles quantized KV fine.
  • Mobile thermal throttling: on some mobile configs successive repeats get slower (e.g. ppTPS 850 -> 492 -> 428 across 3 reps), which widens the ± stddev on those rows. This is genuine sustained-load throttling on real devices, not measurement error — the stddev reflects it honestly.
  • Pixel 9 Pro GPU TTFT/ppTPS are notably weaker than the other devices across quants, consistent with a Vulkan/driver characteristic; CPU results are plausible.

📦 Notes

  • Benchmark/test infrastructure only — no addon index.js/native or public-API change, so no version bump or CHANGELOG entry ([skiplog]).
  • Pairs with #2382 (workflow infra, already merged).

@gianni-cor

gianni-cor commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

just for mobile, can you run the bench on both CPU and GPU?

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

@donriddo

donriddo commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

just for mobile, can you run the bench on both CPU and GPU?

Already does. mobile.config.json sets "devices": ["gpu", "cpu"] and benchmark-perf.test.js loops over both, so each model and quant runs on CPU and GPU.

@donriddo

This comment was marked as resolved.

donriddo added 3 commits June 10, 2026 18:20
…-llm-suite

# Conflicts:
#	.github/workflows/integration-mobile-test-llm-llamacpp.yml
A comparison requested via compare_run_id renders delta columns against a
baseline run. When the baseline produced no benchmark rows (e.g. only its
run-meta/desktop-meta metadata artifacts were downloaded), the comparison was
silently empty: the report rendered with no deltas and the job went green even
though the requested comparison was never produced. render-report.js now exits
non-zero when compareDir is set but the baseline has zero rows. This is distinct
from a baseline that has rows but none matching the current devices, which still
renders a per-device note.
The grid is 2 x 5 x 7 after the TurboQuant/PolarQuant expansion, not 2 x 5 x 3.
maxim-smotrov
maxim-smotrov previously approved these changes Jun 10, 2026
jesusmb1995
jesusmb1995 previously approved these changes Jun 11, 2026

@jesusmb1995 jesusmb1995 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only possible .html report missing, maybe can be done in follow up.

donriddo added 3 commits June 11, 2026 11:12
The consolidated report is now over a thousand rows, which is hard to scan.
render-report.js gains two visual outputs:
- A Charts section embedded in the markdown as Mermaid xychart bars, so a device
  throughput ranking and the KV-cache / quantization comparison for the fastest
  device render inline in the GitHub step summary.
- A --html output that writes a self-contained file (inline SVG, no deps or CDN)
  with the full per-device grouped charts and stddev error bars.

The summarize job emits both; the markdown points viewers to the HTML file
(uploaded with the report artifacts) for the full per-device view.
Each mobile shard loads the model once per backend (gpu, cpu) and sweeps both
reasoning-budget values on it. The warm-up was inside that reasoning-budget
loop, so every backend warmed up twice. But the warm-up only primes the GPU
kernels/caches for the loaded model, which the reasoning budget (a per-call
generation param) does not change, so the second warm-up was pure overhead
(~47s gpu / ~23s cpu per shard, discarded). Warm up once per backend; the three
measured repetitions and their mean/stddev for TTFT, TPS and ppTPS are unchanged.
jesusmb1995
jesusmb1995 previously approved these changes Jun 11, 2026
jpgaribotti
jpgaribotti previously approved these changes Jun 11, 2026
The Stamp desktop device step interpolated the nvidia-smi GPU name directly
into the printf inside its run block. Route it through a GPU_NAME env var so
the value reaches the shell as data rather than as expanded workflow syntax,
matching the env-mapping already used for the dispatch inputs elsewhere in
this workflow. Keeps the no-interpolation-into-run-blocks invariant uniform
across every step.
maxim-smotrov
maxim-smotrov previously approved these changes Jun 11, 2026
@donriddo

Copy link
Copy Markdown
Contributor Author

/review

donriddo added 12 commits June 12, 2026 12:47
The mobile chart helpers averaged a metric over every row sharing a
(device, category) key, so a single bar blended both backends (gpu and
cpu), both model sizes and both reasoning budgets — a value no real
configuration produced — and its stddev whisker was the spread across
those blended configs, not the measured 3-rep noise.

Charts now hold every axis but the one on the x-axis at a fixed value
(size 2B, reasoning budget -1, and the non-varied categorical at its
default: weights Q4_K_M for KV-cache charts, KV f16 for the quantization
chart), so each bar is one measured configuration and its whisker is that
config's own 3-rep stddev. gpu and cpu are charted separately and never
blended, with a shared y-scale per metric. The inline mermaid is reduced
to one device-ranking chart at a single stated config. Crashed configs
remain missing bars rather than zeros, and the download note now names the
real artifact (qwen35-benchmark-findings) and the file inside it.
Coverage compared the reported shards against the renderer's CURRENT
matrix, so re-rendering an older run after the matrix grew showed it as
falsely incomplete: a complete 30-shard run read 30/70 against today's
70-shard matrix.

The stamp-version job now records the run's expected shard list into
run-meta.json alongside the addon version, and coverage scores against
that stamped list when present, falling back to the live matrix only for
runs that predate the stamp. A re-render of a stamped run is therefore
always scored against the matrix it actually targeted, while genuinely
missing shards are still surfaced.
The report's chart note told readers to open qwen35-benchmark-charts.html
but gave no link, so they had to scroll to the run's artifacts section and
download it by hand.

The renderer now takes an optional --charts-url and, when given, renders the
artifact mention as a markdown link. The summarize job uploads the report
first so the artifact's download URL is known, then substitutes that URL into
the note before writing the run summary (falling back to the run page URL if
the upload yields none). Local renders pass no URL and keep the plain text,
so there is never a dangling link.
…ebuilds

The desktop sweep ran on the GitHub-hosted GPU runner and built the addon
from source, using disk-cleanup hacks (docker prune, rm -rf /opt/...) meant
for ephemeral runners — destructive on a shared persistent runner.

Move it to the self-hosted qvac-ubuntu2204-x64-gpu runner the integration
tests use, and download the linux-x64 binary the prebuild job already produces
instead of compiling on the runner. This adds the Manual Workspace Cleanup
self-hosted runners need and drops the source build, the destructive disk
cleanup, and the LLVM/Vulkan/vcpkg setup. The prebuild job now also runs for
desktop-only dispatches so the binary is available to download.
The summarize job fetched the report artifacts with actions/download-artifact,
which verifies the artifact digest and was failing with `digest-mismatch` on
otherwise-intact artifacts (the gh CLI downloads the same files without issue).
Under continue-on-error that left the input directory silently empty, so the
render step reported a misleading "no benchmark reports found" and exited.

Switch the current-run and baseline downloads to `gh run download`, which pulls
the artifacts by name prefix and run id without the digest check, and emits a
warning rather than masking a real failure.
The summarize job downloads the report artifacts with `gh run download`, which
calls the Actions artifacts API and needs actions:read. The job only granted
contents:read, so the download returned nothing and the render step reported
"no benchmark reports found". Add actions:read.
…-llm-suite

# Conflicts:
#	.github/workflows/integration-mobile-test-llm-llamacpp.yml
prebuild now needs verify-shards, so a benchmark-shard matrix drift fails the run in ~30s instead of after the expensive prebuild. !cancelled() + the result check keep a desktop-only run (where verify-shards is skipped) working. The verify-shards comment is corrected to match.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NLP llm and embed verified Authorize secrets / label-gate in PR workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants